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Abstract 

Scoring rules are an important tool for evaluating the performance of 
probabilistic forecasting schemes. In the binary case, scoring rules (which 
are strictly proper) allow for a decomposition into terms related to the 
resolution and to the reliability of the forecast. This fact is particularly 
well known for the Brier Score. In this paper, this result is extended 
to forecasts for finite-valued targets. Both resolution and reliability are 
shown to have a positive effect on the score. It is demonstrated that 
resolution and reliability are directly related to forecast attributes which 
are desirable on grounds independent of the notion of scores. This find- 
ing can be considered an epistemological justification of measuring fore- 
ca st quality by proper scores. A link is provided to the original work 
of iDeGroot and Fienberg] (|l982j ^l . extending their concepts of sufficiency 
and refinement . The relation to the conjectured sharpness principle of 
iGneiting etail (|2005ah is elucidated. 



1 Introduction 



IBrownl (|l970l) argues that it seems reasonable to value forecasts (be they proba- 



bilistic or other) by a scheme related to the extend to which the forecasts "come 
true". Scoring rules provide examples for s uch schemes i n the c ase o f proba- 
bilistic forecasts. After pioneering work by iGoodl (Il952l) ; iBrierl (|l950l) . scores 



were thoroughly investigated in the 1960's and 1970's. The score was effectively 
thought of as a reward system, inducing (human) experts to provide th eir judg- 
ments or pred i ction s regarding uncertain events in terms of probabilities (jBrownl . 
197fi Savage, 1971 ). In this respect, scoring rules were devices to elicit prob- 



abilities from humans. The importance of using proper scores was recognized 
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already by iBriei ( 1950| ) (see also iBrown , 1970l . for an entertaining discussion 



and "some horrible examples"). The central argument is that a forecaster's 
probability assignment should be independent of the particular reward system, 
which is guarant eed if the reward system constitutes a proper score. I Savage! 
( 1971ft (following |de ^inetti, Il970ft joints out that this universality property 



allows for an alternative definition of subjective probability, which is a concept 
of probability independent of the notion of relative observed frequency. 

Owing to the enormous increase in computer power over the last decades, 
it became computationally feasible to numerically produce probabilistic fore- 
casts for dynamical processes, employing models of ever increasing complex- 
ity. Since it is obviously irrelevant whether probabilities are produced by hu- 
mans or machines, scores provide a tool to evaluate probabilistic numerical 
forecasting systems, too. In weather forecasting, scores had already been used 
to evaluate subjective forecasts (issued by expert meteorologists), for example 



of rain, long before numerical weather forecasts became available (IBrier 



IWinkler and Murphvl .[l968: Eps teTnlll969tlMurphv and Winkleiill977t ). Nowa- 



probabilistic weather forecasts (Gnciting et al. 


. l2005bt Gneiting and Raftervl. 


2007 


: Brocker et al!l2004: Raftcrv et al.. 2005: 


Wilks. 2006a; Brocker and Smith 


2008 


)■ 



In contrast to the expert-judgment-forecasts considered in earlier works on 
scores, weather forecasts are often issued over a long period of time under (more 
or less) stationary conditions, allowing for archives of forecast-observation pairs 
to be collected. This fact allows to reconsider the interpretation of probabil- 
ities as long time observed frequencies. If we were to forecast the probabil- 
ity of rain on a large number of occasions, we would like rain to occur on 
a fraction p of those instances where our forecast was (exactly or around) p. 
A fo recast having this property (up to stati s tical fluctuations) is called reli- 
able (|Murphv and Winklerl . Il977t iToth eTail 120031 : IWilkd . l2006bh . If a large 



archive of forecast-observation pairs is available, reliability becomes a sensible 
property to ask for. As has been widely noted previously though, it is not dif- 
ficult to produce reliable forecasts if no constraint is put on the information 
content or resolution of the forecast (the exact meaning of these terms is often 
left vague, though). In any event, the grand probability (aka climatological 
frequency) of the target will always be a reliable forecast, and despite the dif- 
ficulties with the term "information content" , many people would presumably 
agree that this forecast is not very informative. 

But how do these virtuous forecast attributes pertain to proper scores? Do 
proper scores reward reliable forecasts? Does a "better informed" forecaster 
really achieve a better score? In this paper, these questions are answered in the 
affirmative (using the appropriate formalisation of "better informed" ) . In Sec- 
tion^ after recalling the notion of reliability, it is shown that proper scores allow 
for a decomposition into terms measuring the resolution and the reliability of 
the forecast. In particular, reliability turns out to have a direct positive impact 
on the score. In Section [3l the concept of suffic iency is introduced, generalising 



similar notions of D eGroot a nd Ficnbcr 3 (|l982ft . Sufficiency formalises the idea 
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of "being more informed" , and is shown to have a direct positive impact on the 
resolution term of the score. 

T he decomposition of Section [Z] is well known for the Brier score (see for ex - 
ample lMurphv and Winklerl . [l98^MurphvlJl996t[Blattenberger and LadlJl985l ), 
a widely used score for forecasting problems with only two categories. The Brier 
score presumably owes muc h of its popularity to this deco mposition, rendering 
its interpretation very clear. DeGroot and Fienberd ( 19821 ) have derived a simi- 
lar decomposition for any proper score in the case of binary targets. (This result 
seems not to be widely known in the atmospheric sciences community, and I be- 
came aware of it rather belatedly during the prepar ation of this manuscrip t.) 
The relation to the conjectured sharpness principle of Gneiting et al.l ( 2005af ) is 
elucidated. The appendix contains several more technical points. Appendix [A] 
provides an equivalent characterisation o f reliability. In Appendi x jBl th e equiv- 
alence between sufficiency according to DeGroot and Fienberd (| 1982h and as 
used in this paper is shown. Finally, the derivation of the decomposition (p~5|) is 
presented in Appendix [Cl 



2 A general decomposition 

In this section, a general decomposition of proper scores will be derived. To 
facilitate the discussion, some convenient notation will be introduced first, sup- 
plemented with a brief reminder on proper scores. Let Y denote the quantity 
to be forecast, commonly referred to as the observation or target!]] The ob- 
servation Y is modelled here as a random variable taking values in a set E. 
For the sake of simplicity, E is assumed to be a finite set of alternatives (e.g. 
"rain/hail/snow/sunshine"), labelled 1 . . .K. Values of Y (i.e. elements of E) 
will be denoted by small lowercase letters like x, y, or z. A probability as- 
signment over E is a if-dimensional vector p with nonnegative entries so that 
Sfee-EP* = ^he se ^ °f au probability assignments over E is denoted by Ve- 
Elements of Ve will be denoted by p, q, and r. A probabilistic forecasting scheme 
is a random variable 7 with values in Ve- In other words, the realisations of 7 
are probability assignments over E. The reason for assuming 7 to be random is 
that forecasting schemes usually process information that will become available 
before and at forecast time. For example, if 7 is a weather forecasting scheme 
with lead time 48h, it will depend on weather information down to 48h prior 
to when the observation Y obtains. The task of designing a forecasting scheme 
is effectively to model t he relationship between this side info r mation and what 
is to be forecasted (see Murphy and Winkler . 1987 : Murphy . 19931 19961 for a 



related discussion) (1 

It was already mentioned what reliability means in case that E contains 
only two elements (1 and 0, say). In the case of more than two alternatives, 
this definition of reliability generalises as follows: On the condition that the 



1 I use italics to indicate that an expression is to be considered a technical term. 
2 1 do not consider forecasting problems which are explicitly dependent on time, for example 
to take into account seasonal effects. 
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forecasting scheme is equal to, say, the probability assignment p, the observation 
Y should be distributed according to p, or in formulae 



F(Y = k\j=p) =p k (1) 

for all k <E E. In particular, a reliable forecasting scheme can be written as a 
conditional probability. As is demonstrated in Appendix [XI the reverse is also 
true: every conditional probability of Y is reliable. In view of Equation ([TJ) , I will 
fix the notation ttJ := P (Y = k\j) , k = 1 . . . K for the conditional probability of 
the observation given the forecasting scheme. Like every conditional probability, 
7r 7 is a random quantity. Hence, 7r 7 is a probabilistic forecasting scheme like 7 
itself. In terms of 7r 7 and 7, the reliability condition (fl]) can be written simply 
as 7r 7 = 7. Since 7r 7 is reliable, it trivially holds that 7r 7 = n^~'\ In any case, 
7r 7 is a function of 7, independent of whether 7 is reliable o r not. 



L et us turn our attention to sco ring rules (see for example lMatheson and Winkler 



1976t Gneiting and Rafterv . 2007 ). A scoring rule is a function S(p,y) which 



takes a probability assignment over E as its first argument and an element of 
E as its second argument. For any two probability assignments p and q, the 
scoring function is defined as 

s(p,q) = Y,S(p,k)q k . (2) 

keE 

The interpretation of the scoring function is that if Z is a random variable of 
distribution q, then s(p, q) is the mathematical expectation of the score of the 
assignment p in forecasting Z . It is our convention that a small score indicates 
a good forecast. A score is called proper if the divergence 

d(p,q) =s(p,q) -s(q,q) (3) 

is nonnegative, and it is called strictly proper if d(p, q) = implies p = q. 
The interpretation of d(p, q) as a divergence is obviously meaningful only if the 
scoring rule is strictly proper. From now on, I assume that scoring rules are 
strictly proper. It is important to note that d(p, q) is, in general, not a metric, 
as it is neither symmetric nor does it fulfil the triangle inequality. The quantity 

e{p)=s(p,p) (4) 

is called the entropy of P E Tabled gives a couple of frequently used scoring 
rules along with the corresponding divergences and entropies. 

Table [I] on top of this or the next page 

For strictly proper scores, 

e(p) =Ms(q,p). (5) 
1 



3 Gnciting and Raftcry (2007) refer to — e(p) as either the generalised entropy function or 
the information measure, but since entropy is commonly interpreted as a lack of information, 
I define e(p) to be the entropy. 
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Name 



scoring rule S(p,y) 



divergence d(q,p) 



entropy e(p) 



\y-p\ 2 \p-i\ 2 p(i-p) 

-logPy E- lo g(^)9fc E- lo g(Pfc)Pfc 

CRPS^ f(F(z)-H(y-z))' 2 dz J(F(z)-G(z)) 2 dz J F(z)(l - F(z))dz 



Brier^ 
Ignorance^ 



PSS^ 

PLS e 



-lbll« 
•Erf 



a For binary cases (i.e. E = {0, 1}). 
'Propriety follows from Jensen's inequality. 

c Continuous Ranked Probability Score - Here F and G are the cumulative distribution 
functions corresponding to p and q, respectively. 



Propriety follows from 



d Pseudo-spherical Scores — Here a > 1, while ||p|| 
Holder's Inequality. 

^Proper Linear Score, also referred to as the quadratic score. For binary cases (i.e. E = 
{0, 1}), this score is equivalent to the Brier score 

Table 1: Scoring rule, divergen ce, and e ntropy for sever a l com mon scores. 
All sums extend over E. See lEnsteinl dlQfiflh: iMurrmvl (Il97lh for a dis- 
cussion of the Ranked P robability Score. iMatheson and Winkler (Il976h : 
Gneiting and Raftervi ([20071 ) discuss scoring rules for continuous variables. 



Since s(q,p) is linear in p, Equation ([5|) demonstrates that for strictly proper 
score s, the entropy e(p ) is an infimum over linear functions and hence con- 



cave ( Rockafellaii Il970l ) . For the particular cases listed in Table Q] it should be 



fairly obvious that the entropy is a measure for the uncertainty inherent in a 
probability assignment p. For the Brier score and the Ignorance, the entropy is 
indeed a very common measure of inherent randomness of a distribution. Fur- 
thermore, suppose p and q are two probability assignments featuring the same 
entropy, then intuitively, any mixture of p and q should have a larger inher- 
ent uncertainty than any of the individual probability assignments, an intuition 
which the entropy supports, due to the concavity of e(p). 

Our aim now is to derive a decomposition of the expected score E [S'(7, Y)] 
of the forecasting scheme 7. Since 7 is random, the expectation affects both 7 
and Y. An elementary property of the mathematical expectation gives 

E[S( 7 ,y)]=E[E[5( 7) F)| 7 ]]. (6) 

To calculate the conditional expectation E [5(7, 50 M, ^ ne probability of Y given 
7 is needed, but this is just 7r 7 , whence 

E[S(7,Y)| 7 ]=s(7,7^). (7) 

Substituting with Equation ([7]) in ([6]) results in 

E[S( 7 ,y)] =Es(7,7r 7 ). (8) 
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From Equations © and (0} we get 

s(7,7rT) = e(^ 7 )+d( 7 ,7r 7 ). (9) 

Taking the expectation on both sides of Equation © and substituting for the 
right hand side in (JSJ), we obtain 

E[S{j,Y)} = Ee(7r 7 )+Ed(7,7r 7 ). (10) 

The first term in Equation (|10p . the average entropy of 7r 7 , can be decomposed 
further. Consider the (nonrandom) assignment obtained by taking the average 

Of 7T 7 , 

7f:=E7T 7 (11) 

It is easily seen that n is just the unconditional probability of Y, which in 
meteorology is often referred to as the climatology of Y. Since s(-7f, 7r 7 ) is linear 
in 7r 7 and tt is not random, it follows immediately from Equation (|lip that 

Es(7f,7T 7 ) = S(7f,7f) = e(ff). (12) 

Adding and subtracting Es(ff, 7r 7 ) on the right hand side of Equation (|10|) and 
using Equation (fl2|) we arrive at 

Es( 7 ,y) = e(vf) -Ed(7f,7r 7 )+Ed(7,7r 7 ). (13) 

Equation (|13|) constitutes the desired decomposition of the expected score of 
the probabilistic forecasting scheme 7. This decomposition is, as I will argue, 
completely analogous to and a generalisation of the well known decomposition 
of the Brier score. The three terms in Equation ([13"! will be (from left to right) 
referred to as the uncertainty of Y, the resolution terrrQ, and the reliability term. 
As a starting point for the discussion of the decomposition (|13j) , the reader might 
want to convince himself (with the help of Table []} that for the Brier score, 
Equation (| 1 3|) indeed yields the known decomposition. Firstly, the uncertainty 
of Y is the entropy of the climatology and hence can be interpreted as the 
expected score of the climatology as a forecast, quantifying the ability of the 
climatology to forecast random draws from itself. The resolution term E d(7f , 7r 7 ) 
contributes negatively to the score. Note that due to the strict propriety of 
the score, the resolution is always positive definite. Since the resolution term 
describes the average deviation of 7r 7 from its average 7f (see Equation [TTj) . it can 
be interpreted as a form of variance of 7r 7 . The larger the variance, the better 
the score. This term reduces to the standard variance of 7r 7 in case of the Brier 
score. Finally, the reliability term (which is again positive definite) describes 
the average deviation of 7 from 7r 7 . Recalling that 7 = 7r 7 indicates a reliable 
forecast, the interpretation of the reliability term as the average violation of 
reliability becomes obvious. 

4 Also called sharpness term 
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3 A decomposition of the resolution term 



The decomposition (fT5|) demonstrates how the score changes if the forecast 
scheme 7 changes, but so that 7r 7 remains constant. In this case, any deviation 
of 7 from 7r 7 has adverse effects on the score. But in general, changing 7 
means that 7r 7 changes, too. Thus, changes 7 usually entail changes in both the 
reliability and the resolution term of the decomposition (|13|) . The changes in 
the resolution term are investigated in this section. The intuitive interpretation 
of the resolution term is that it somehow measures the average information 
content of the forecast scheme. In this section, I will d i scuss the concept of 
forecast sufficiency, introduced by DeGroot and Fienberel (.1982). This concept 



formalises the notion of being "more or less informed" and allows for the partial 
ordering of forecasting schemes. As will be seen in this section, ji will have at 
least the same resolution as 72 if 71 if sufficient for 72. Thus, the expected score 
reproduces the same ordering as sufficiency. This result establishes a connection 
between a quantitative notion of information as provided by the score, and a 
qualitative notion of information contents as provided by sufficiency. This is 
analogous to the relation between the reliability term of the decomposition (|13|) 
and the qualitative reliability condition ([1]). 

I call a forecasting scheme 71 sufficient for a forecasting scheme 72 if 

7T 2 =E [tt 1 ^ 2 ] , (14) 

where the abbreviations n 1 := n 11 = P (y I71) and analogously for it 2 were used@ 
In Appendix [Bj it is shown that t he present notion of sufficienc y is equivalent 
to the corresponding definition of DeGroot and Fienber j (1982). Before con- 



tinuing with score decompositions, let me try to elucidate the rather technical 
condition Q14[) with a somewhat informal interpretation. Suppose the forecaster 
who is running forecasting scheme 71, albeit having no access to the current 
value of 72 , collected a large archive of past values of 72 and hence is able to 
fit a good approximation to P(72|7i)- With this information, he tries to mimic 
forecasting scheme 72 as follows. The forecaster's mimicry version of 72 (which 
we denote by 7!) is just a random draw of P(72|7i) (conditioned on his own 
forecast 71). Since the expected score of any forecast scheme depends only on 
the compound distribution of the forecast scheme and Y, the mimicry forecast 
72 will achieve the same expected score as the real 72 if the compound distri- 
bution of (72 , Y) and (7! , Y) are the same. It is straight forward to work out 
that the latter condition is equivalent to (fT4"|) . In brief, if 71 is sufficient for 72, 
then by appropriate randomisation of 71, a forecast 72 is obtained which has 
the same statistical properties as 72. Note also that in particular 71 is sufficient 
for 72 if 72 can be written as a function of 71 . 

In Appendix [C] it is shown that if 71 is sufficient for 72, it holds that 

Ed(7f,7r 2 ) =Ed(7f,TT 1 ) -Ed( 7 r 2 ,7r 1 ). (15) 



5 If both 71 and 72 are reliable, then condition (1146 modifies to ■y 2 = E [7 1 |7 2 ]. In this 
situation, 7 1 is said to be at least as refined as j 2 . 
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Keeping in mind that Ed(7f,7r 1 ) and Ed(7f,7r 2 ) are the resolution terms of 71 
and 72, respectively, and that d(. . .) is never negative, Equation (p~5|) demon- 
strates that the resolution of 72 will be at most that of 71. More generally, 
Equations (fT5| and (jTHJ) together allow for the following conclusions as to the 
approach of scoring forecasting schemes using strictly proper scores: 



The forecasting scheme 7r 7 achieves the best possible average score among 
all forecasts for which 7 is sufficient. If the score is strictly proper, it 1 
is uniquely defined through this optimum property, in the sense that any 
forecast for which 7 is sufficient is either equal to 7r 7 or it will have a 
worse average score. This can be considered an answer to the conjec- 
tured sharpness principle of iGneiting et alj (|2005al ). reinterpreted in our 
framework. 



• Per se, it is impossible to say how the score will rank unreliable forecast 
schemes, even if one is sufficient for the other. The lack of reliability of 
one forecast scheme might be outbalanced by the lack of resolution of the 
other. 



• It is also not clear how the score will rank forecast schemes (reliable or 
unreliable) as long as none of the two forecast schemes is sufficient for the 
other. It seems plausible that the actual ranking of such forecasts will 
depend on the particular scoring rule employed. 



4 Conclusion 



The score of a probabilistic forecast was shown to decompose into terms related 
to the uncertainty in the observation, the resolution of the forecast, and its reli- 
ability, generalising corresponding results for the Brier score. The only property 
required of the score is that it be strictly proper. By using a widely accepted 
characterisation of reliability, an d furthermore by genera l ising the concepts of 
sufficiency and refinement due to iDeGroot and Fienber J (1982), it was argued 
that both the resolution and the reliability term in the decomposition quantify 
forecast attributes for which the case can been made independently (i.e. not 
referring to scoring rules). These results provide an epistemological justification 
of measuring forecast quality by p roper scores. Furthermo re, the relation to the 
conjectured sharpness principle of IGneiting et ah (|2005al ) was mentioned. 
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Appendix 



A An alternative definition of reliability 

In this section, it will be shown that any conditional probability is reliable. The 
rea der is assumed t o be familiar with the basic notions of probability theory (see 



e.g. lBreimanl . ll973l . chapter 4). Let 7 be a probabilistic forecasting scheme which 



can be written as a conditional probability, that is 

¥{Y = k\T) = lk (16) 

for all k E E and some sigma algebra T. On both sides of Equation (fl6|) , we 
take the mathematical expectation conditioned on 7. The right hand side gives 
back 7^. To compute the left hand side, note that because of Equation (fTB")) , 7 
is ^-measurable. Hence 

E[F(Y = k\F)\j]=E[E[5 Y ,k\F}\j} 

= E[6 Y ,kh] (17) 
= P(F = fc| 7 ). 

This demonstrates that P(Y = k\j) = 7fc, which is the condition for reliability. 



B Sufficiency and refinement of DeGroot and 
Fienberg 

Let 71, 72 and ir 1 , ir 2 as in Section[3J With these definitions, 71 is sufficient for 72 
if 7r 2 = E [ir 1 1 72] . It will now be shown that this is equivalent to the sufficiency 
condition given bv lDeGroot and Fienberel (| 19821) . Equation (4.3). To state the 



latter condition, I assume that the conditional probability of 71 given Y and the 
conditional probability of 72 given Y, respectively, have densities gi{p\Y) and 
92(p\Y), respectively. Furthermore, the conditional probability of 72 given 71 is 
assumed to have a d ensity fe(72l7i)- With these co nventions, 71 is sufficient for 
72 in the sense of of lDeGroot and Fienberd dl982h . if 



92(l2\Y)= / M72I71) <?i(7i 1*0 d7i- (18) 
Multiplying both sides by 7f and dividing by the density of 72 we obtain 

tt 2 (72) = / 7r 1 (7i)/(7il72)d7i, (19) 

JVe 

with /(71I72) being the conditional probability of 71 given 72. Here we need 
to write explicitely that ir 1 and tt 2 depend on 71 and 72, respectively. But the 
right hand side of Equation (fT5|) is just E [V 1 ^] . 
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C Derivation of Equation 



15 



Still, 71,72 and "Tr 1 ,^ 2 are as in Section [3] By just applying definitions, we get 

d(ff , 7T 2 ) = S(ff , 7T 2 ) - S(7T 2 , 7T 2 ) 

= s(7f,7T 2 ) - s(7r 1 ,7T 1 ) (20) 

-(s(7r 2 ,^ 2 )-s(7r 1 ^ 1 )). 

The mathematical expectation of the first term can be written as 

Es(TfV) =E[E [S(w,Y)\-f2]] 

= E[E[5(7f,F)|7i]] (21) 

= Es(7f,7T 1 ), 

using elementary properties of the conditional expectation and the fact that 7f is 
not random. Next, the mathematical expectation of the third term is considered: 

Es(7r 2 ,7r 2 ) = E [s(7r 2 ,E [7r 1 | 72 ])] 

= E[E[ S ( 7 r 2 , 7 r 1 )| 72 ]] (22) 

= ES(7T 2 ,7T 1 ). 

The first equality is due to sufficiency; the second is valid because ir 2 is a 
function of 72, so it can be taken under any expectation conditioned on 72; and 
the third equality uses elementary properties of the conditional expectation. 
Taking the expectation over Equation (j2"0|) and using Equations (|21[ |2"2"|) , we 
obtain Equation (fTS"]) . 
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